Whole-genome sequencing of 13 Arctic plants and draft genomes of Oxyria digyna and Cochlearia groenlandica

To understand the genomic characteristics of Arctic plants, we generated 28–44 Gb of short-read sequencing data from 13 Arctic plants collected from the High Arctic Svalbard. We successfully estimated the genome sizes of eight species by using the k-mer-based method (180–894 Mb). Among these plants, the mountain sorrel (Oxyria digyna) and Greenland scurvy grass (Cochlearia groenlandica) had relatively small genome sizes and chromosome numbers. We obtained 45 × and 121 × high-fidelity long-read sequencing data. We assembled their reads into high-quality draft genomes (genome size: 561 and 250 Mb; contig N50 length: 36.9 and 14.8 Mb, respectively), and correspondingly annotated 43,105 and 29,675 genes using ~46 and ~85 million RNA sequencing reads. We identified 765,012 and 88,959 single-nucleotide variants, and 18,082 and 7,698 structural variants (variant size ≥ 50 bp). This study provided high-quality genome assemblies of O. digyna and C. groenlandica, which are valuable resources for the population and molecular genetic studies of these plants.

Cochlearia is also spoonwort.Some Cochlearia plants contain enough vitamin C and another common name, "scurvy-grass, " reflects its use as a traditional remedy for scurvy, a disease caused by the deficiency of vitamin C. Salt and heavy metal tolerance has been reported in some Cochlearia plants 15 .
Cochlearia groenlandica is a biennial, occasionally short-lived, perennial herb with a wide distribution in Greenland, Svalbard, Iceland, Alaska, Canada, and Russia 16 .It grows in various habitats, including gravelly and sandy plains, sediment plains, moss tundra, patterned grounds, seashores, and bird-cliff meadows.It can grow relatively well regardless of the soil pH or nutrient level and can survive and reproduce under harsh conditions as a stunted form 17 .The size of individual plants and leaves varies; they can be ten times larger in nutrient-rich areas, such as below bird cliffs.Polar bears graze on C. groenlandica at the foot of a large seabird colony on a cliff in Spitsbergen, Svalbard 18 .Chromosome counting confirmed the haploid chromosome number of C. groenlandica at x = 7 19,20 .There are two populations of C. groenlandica in Iceland with genetic and morphological differences.The alpine population is genetically and morphologically similar to those in Greenland and Svalbard, whereas the coastal population is different from the Arctic population 20 .Similar to O. digyna, C. groenlandica has no reference genome or limited genomic resources, which limits the genetic understanding of this Arctic plant.
In this study, we provide short-read sequencing data of 13 Arctic plants as well as high-quality draft genome sequences of O. digyna and C. groenlandica generated using the high-fidelity (HiFi) long-read sequencing technology of Pacific Biosciences (PacBio) (Fig. 1).We assessed the assembly quality and annotated the genetic variants, repetitive sequences, and genes in these genome assemblies (Fig. 2).These are the first draft genomes of the genera, Oxyria and Cochlearia, based on long-read sequencing data.This will be a valuable resource for understanding the genomic characteristics of Arctic plants and the population structures of O. digyna and C. groenlandica.

Methods
plant sampling.Arctic plant samples were collected from Spitsbergen Island in the Svalbard Archipelago between July 25 and August 3, 2021.Thirteen dominant species in Svalbard were selected for genome size estimation, including Eriophorum scheuchzeri ssp.arcticum, Papaver dahlianum, Silene acaulis, Silene uralensis ssp.arctica, Bistorta vivipara, Oxyria digyna, Saxifraga oppositifolia, Cochlearia groenlandica, Salix polaris, Dryas octopetala, Betula nana ssp.nana, Cassiope tetragona, and Polemonium boreale (Fig. 1).Only the indicated number of leaves required for analysis (less than 2.0 g) was collected from each plant to minimize its impact on the local population.The collected samples were immediately placed in yellow envelopes and dried in the field using silica gel.The plant leaves were lyophilized in envelopes at the Dasan Arctic Research Station and stored in 15-mL conical tubes at 4 °C until they were transferred to Korea.
DNA extraction and illumina sequencing.DNA was extracted using the GeneAll ® ExgeneTM Plant SV Mini Kit (GeneAll Biotechnology).DNA extraction was performed by slightly modifying the manufacturer protocol.The concentration and quality of the extracted DNA were checked using a NanoPhotometer ® NP80 (Implen GmbH), and the DNA quality was checked by electrophoresis.
Illumina sequencing was performed by DNALink (https://dnalink.com/).The DNA library for sequencing was produced using an Illumina DNA Sample Prep Kit, and the completed DNA library was sequenced with paired-end reads of 151 bp using the Illumina NovaSeq 6000 platform.DNA sequencing data (28-44 Gb) were obtained (Table 1) 21 , which were suitable for estimating genome sizes of approximately 1 Gb.
Genome size estimation.KAT (version 2.4.1) was used to estimate the genome size of 13 Svalbard plants 22 .
The command kat hist -m 27 was applied to all prepared Illumina sequencing data.The genome sizes of the eight species ranged from 180 Mb to 894 Mb (Table 1).It is noteworthy that if their genomes are highly repetitive, the true genome size could be larger than the estimated size, although the order of magnitude would be similar.We were unable to determine the genome sizes of Bistorta vivipara, Papaver dahlianum, Polemonium boreale, Silene acaulis, and Saxifraga oppositifolia.We suspect that the genome size of these species may have exceeded 1 Gb.Because the genome sizes of O. digyna and C. groenlandica were estimated to be relatively small, ~352 and ~180 Mb, respectively, and their haploid (n) chromosome numbers were only seven, we focused on constructing the draft genomes of these plant species.identification of telomeric repeats.We also attempted to annotate the telomeric repeats of each species using the short-read sequencing data, because telomeric repeats can be changed in plants and other species-rich phylums [23][24][25][26][27][28][29][30][31][32][33] .As telomeric repeats should be found as sequential and continuous repetitive sequences in a single read, we subsampled 20 million reads from the Illumina reads of the 13 Svalbard plants and only used 60-bp regions of each read by trimming off the initial 10 bp and the last 81 bp.We then counted k-mers of sequentially repetitive units that were similar to the canonical plant telomeric repeat, TTTAGGG (Arabidopsis-type) 26 .
DNA and RNA sequencing analyses of Oxyria digyna and Cochlearia groenlandica for de novo genome assembly.All the experiments described in this section were performed by NICEM (https://nicem.snu.ac.kr).DNA was extracted using the cetyltrimethylammonium bromide (CTAB) method 34 .The library was prepared using the SMRTbell template prep kit 2.0 and sequenced on a PacBio Sequel IIe sequencer using the Sequel II sequencing kit 2.0.The computed data were subjected to the Circular Consensus Sequencing (CCS) protocol of the SMRT Link (ver.11.0.0.146107) to derive the HiFi data (CCS reads of Q20 or higher).Based on the estimated genome sizes, we generated 25.5 Gb (1.5 M reads, 17 kb on average) and 30.2Gb (1.6 M reads, 18 kb on average) of HiFi long-read sequencing data for O. digyna and C. groenlandica (Table 1 and Fig. 3b) 35,36 .HiFi reads were utilized to re-estimate their genome sizes using a k-mer-based method, KMC (version 3.2.1;kmc -k21 -ci1 -cs1000000 and kmc_tools transform histogram -cx1000000) and GenomeScope2 (version 2.0; genomescope.R --max_kmercov 1000000 -k 21), with parameters optimized for highly repetitive genomes [37][38][39][40][41] .We estimated that C. groenlandica has ~275 Mb with a 0.15% heterozygosity and that O. digyna has ~561 Mb with a 0.43% heterozygosity (Fig. 3c,d).These estimated genome sizes differ from the initial ones estimated by KAT, but are much closer to the genome assembly sizes (see the next section).
Total RNA was extracted using the GeneAll Hybrid-RTM kit and RNA quality was measured using an Agilent Bioanalyzer RNA Nanochip.The library was prepared using the NEXTflex Rapid Directional RNA-Seq Bundle Kit (Perkin Elmer).qPCR was performed to quantify libraries capable of forming clusters and RNA sequencing was performed on the Illumina NovaSeq 6000 platform.We obtained 14 and 25.7 Gb (92.6 M and 170.3 M 151-bp paired-end reads) of RNA sequencing data for O. digyna and C. groenlandica, respectively (Table 1) 35,36 .De novo genome assembly and quality assessment using repetitive sequences and evolutionary conserved genes.For de novo genome assembly using HiFi sequencing data, data processing was conducted as described previously 42 .Specifically, we assembled 45 × and 121 × HiFi raw reads of O. digyna and C. groenlandica, respectively, into contigs using hifiasm (version 0.16.1;default option for O. digyna and hifiasm -s 0.51 for C. groenlandica) and converted two GFA-formatted output files to FASTA-formatted files 43,44 .Previously known angiosperm repetitive sequences were masked using RepeatMasker (version 4.1.2.p1; RepeatMasker -species angiosperms -s) and identified O. digyna-and C. groenlandica-specific repetitive sequences using RepeatModeler (version 2.0.3;BuildDatabase and RepeatModeler -database -LTRStruct -ninja_dir) to mask species-specific repeats (version 4.1.2.p1; RepeatMasker -lib -s) 45 .
We further annotated repeats that were not classified by RepeatModeler using a transposon classification tool, RFSB (version 1.0; transposon_classifier_RFSB -mode classify -fastaFile -outputPredictionFile) 46 .The Arabidopsis type telomeric repeat, TTTAGGG, was searched among all ends of the contigs using RepeatMasker outputs, and the telomeric repeat clusters were manually validated 26 .To determine whether repeat density affected assembly quality, we calculated read depths and repeat compositions of 100-kb binned intervals of contigs ( ≥ 1 Mb for C. groenlandica and ≥ 10 Mb for O. digyna were used).We divided these intervals into the less repetitive or more repetitive groups using median repeat compositions of the intervals (median repetitive ratio: 0.57 for C. groenlandica and 0.69 for O. digyna).Their raw read depths were visualized as violin plots.BUSCO values were calculated using BUSCO and its Eudicot database (version 5.2.2; busco -m genome --auto-lineage-euk) 47,48 .Genome assembly quality values (QVs) and completeness were calculated based on k-mer using Merqury (version 1.3; meryl k=21 count output and merqury.shwith default option) 49 .
De novo genome assemblies using HiFi reads represented two partially phased haplotypes (primary and alternative) for both O. digyna and C. groenlandica (Fig. 4a and Table 2) 50,51 4a).These genome assembly sizes were concordant with the estimated genome sizes under the assumption of highly repetitive genomes (Fig. 3c,d).
Indeed, 69.0% (388 Mb) and 68.5% (374 Mb) of O. digyna and 61.9% (155 Mb) and 69.8% (118 Mb) of C. groenlandica were masked as repetitive sequences (Fig. 4b and Table 3).Approximately 96% (371 and 361 Mb) of O. digyna and 90% (142 and 104 Mb) of C. groenlandica repetitive sequences were classified as interspersed repeats, including DNA transposons and retroelements.As repetitive regions are difficult to be well assembled, we analyzed whether more repetitive regions exhibit more skewed read depth distributions by comparing read depths of less and more repetitive regions of the genome assemblies (Fig. 4c).In C. groenlandica, median read depths were similar in the two more and less repetitive regions (108.37  most repetitive regions were well resolved (Fig. 4c).However, some regions with high repeat density exhibited significantly high read depth (up to 6 times than the median), suggesting that some highly repetitive regions collapsed in our genome assembly (Fig. 4c).In O. digyna, the two regions showed much more similar read depth distributions and much lower variances than those of C. groenlandica (median: 45.04 for low and 44.00 for high; standard deviation: 5.02 for low and 5.34 for high).It indicates that repetitive regions in the O. digyna assembly were resolved as similar as non-repetitive regions.Moreover, the complete single-copy and complete duplicated BUSCO values of the assemblies were 93.3% and 93.2%, respectively, for the two O. digyna genome assemblies (eudicots_odb10) and 90.4% for the primary genome assembly of C. groenlandica (brassicales_odb10) (Fig. 4d), suggesting their high contiguity.However, the alternative genome assembly of C. groenlandica exhibited a complete BUSCO value of only 51.4%.This implies that it has many missing parts in its genome, concordant with its smaller genome size than that of the primary genome assembly (250 Mb vs. 170 Mb).These genome assemblies were also assessed using k-mer-based QVs and completeness.O. digyna genome assemblies exhibited over QV60 and 92% completeness (Fig. 4d).In contrast, the primary genome assembly of "uniq" represents the ratio of unique sequences; "aa" represents the homozygosity rate; "ab" represents the heterozygosity rate; "kcov" represents the coverage for heterozygous k-mers; "err" represents the read error rate; "dup" represents the read duplication rate; "k" represents k-mer size used; "p" represents the ploidy level.
C. groenlandica exhibited over QV60 and 87% completeness, but its alternative genome assembly exhibited only QV56 and 51% completeness (Fig. 4d).By merging primary and alternative genome assemblies for each species, their QVs were close to QV60, and their completeness was approximately 98% (Fig. 4d).It is noteworthy that the QVs were calculated using raw HiFi reads and HiFi-based genome assemblies, so the values could be overestimated, as they were not independently generated.
Overall, these metrics imply that our genome assemblies are highly accurate at the base level and mostly represent the full genome, except for the alternative genome assembly of C. groenlandica.We recommend only the primary genome assembly of C. groenlandica.These genome assemblies could be further scaffolded at the chromosome level in the near future with the advancement of other long-read and long-range sequencing technologies.
We identified 43,105 and 29,675 possible protein-coding genes in O. digyna and C. groenlandica, respectively, of which 33,134 and 27,381 were searched in the UniProt database (Table 4).By mapping raw HiFi reads onto our reference genome, we identified 765,012 and 88,959 heterozygous SNPs comprising 0.14% and 0.04% of the total genomic length, respectively (Table 4).The effect of SNPs on genes were categorized as follows: 1,001 high-impact, 22,592 moderate-impact, and 15,594 low-impact SNPs for O. digyna and 253 high-impact, 6,536 moderate-impact, and 6,373 low-impact SNPs for C. groenlandica (Table 4).We found 9,574 deletion SVs, 8,302 insertion SVs, 26 inversion SVs, 97 interspersed duplication SVs, and 83 tandem duplication SVs in O. digyna and 4,079 deletion SVs, 3,577 insertion SVs, 5 inversion SVs, 10 interspersed duplication SVs, and 27 tandem duplication SVs in C. groenlandica (Table 4 and Fig. 4e).Coding sequences in 1,104 and 328 genes of O. digyna and C. groenlandica, respectively, were affected by SVs.Of these, 747 and 147 genes had known homologs (Table 4).circos visualization.The numbers of genes, genetic variants, and repetitive sequences were summed for every 100 kb bin by utilizing the output files of BRAKER2, DeepVariant, SVIM-asm, and RepeatMasker.HiFi read depths for every position were obtained by applying SAMtools to the intermediate BAM files of the DeepVariant analysis (version 1.13; samtools depth -a) and were summed for every 100 kb bin 52 .We visualized the gene, genetic variant, repeat, and read depth densities using Circos (version v 0.69-8; circos -conf configurationFile.txt;plot type = histogram for read depths and plot type heatmap for gene, genetic variant, and repeat densities) 68 (Fig. 2).

Data Records
All sequencing reads generated in this study have been deposited in the NCBI Sequence Read Archive under accession numbers, SRP404573, SRP427161, and SRP468445 21,35,36 .The assemblies have been deposited in GenBank under the accession numbers, GCA_029168935.1 and GCA_040259375.1 50,51 .Genome sequences, gene annotation, and amino acid and coding sequences of annotated genes are available at figshare database 69 .Homology search and variant impact analysis for genes and repeat classification data are also available at figshare 69 .

technical Validation
Illumina DNA sequencing reads and PacBio HiFi DNA sequencing reads were produced > 25 Gb.HiFi reads showed qualified read length distributions (mean read length: 17,289 and 18,429 bp for Oxyria digyna and Cochlearia groenlandica, respectively).Assembly quality was assessed using contig length and BUSCO.The N50 lengths of the primary assemblies were greater than 10 Mb, and the longest contig lengths were 79 and 37 Mb, respectively.BUSCO values exceeded 90%.

Fig. 4
Fig. 4 (a) Cumulative contig length distributions.The primary assembly sizes were used for the genome size for the two species to calculate NG50 (vertical dotted black line).(b) Repetitive sequence compositions in the genome assemblies.Numbers represent the lengths of each repetitive category.(c) Read depth distributions of less and more repetitive regions in C. groenlandica and O. digyna genome assemblies.Each dot represents the read depth of each 100-kb binned interval.(d) BUSCO values of each genome assembly (stacked bar graph).Numbers above the bars represent k-mer-based QV (upper) and completeness (lower) values.The top numbers represent QVs and completeness of merged genome assemblies for each species.(e) Length distributions of structural variants identified in C. groenlandica (upper panel) and O. digyna (lower panel) genome assemblies.
. The two genome assemblies had genome sizes of 561 Mb and 546 Mb for O. digyna and 250 Mb and 170 Mb for C. groenlandica.The longest contig lengths were 79.5 and 55.8 Mb for O. digyna and 36.4 and 28.2 Mb for C. groenlandica, and N50 lengths were 36.9 and 29.4 Mb for O. digyna and 14.8 and 9.0 Mb for C. groenlandica, respectively (Fig.

Table 1 .
and 102.58, respectively), indicating that Species information and sequencing data summary of 13 Svalbard plants.ND, Not determined; NA, Not applicable; Chromosome numbers and ploidy data were obtained from Svalbard Flora (https://svalbardflora.no/).

Table 2 .
Summary statistics of draft genome assemblies.

Table 4 .
Summary of numbers of genes and genetic variants.